1 Abstract

This Rmarkdown provides the analysis for the processed data in the paper entitled “Multivalent Antigen Display on Nanoparticles Diversifies B Cell Responses”. It includes the B cell receptors analysis, plots, and generation of intermediate files for plotting using other programs. The processed files used here are AIRR-compliant datasets which can be downloaded from Zenodo (DOI: 00000000), they are either Human Respiratory Syncytial Virus-specific (RSV-specific) or the full bulk repertoire from the vaccinated non-human primates (Macaca mulatta).

2 Needed libraries

Loaded libraries need to be present.

library(jsonlite)
library(tidyr)
library(treeio)
library(ggtree)
library(rstatix)
library(ggpubr)
library(Biostrings)
library(data.table)
library(vegan)
library(ggplot2)
library(scales)
library(ggprism)
library(dplyr)
library(kableExtra)
source("df_to_fasta.R")
set.seed(123)

3 Load data

Some data used here is stored as .rds format, but it can also be found as .tsv at Zenodo (DOI: 000000000).

data_comp <- read.csv("../data/ELISA_comp/2023-03-06_normalized_plasma_compt.csv") %>%
  mutate(group = case_when(group == "NP 100%" ~ "20-mer",
                           group == "NP 50%" ~ "10-mer",
                           group == "Sol" ~ "1-mer",
                           group == "PostF" ~ "PostF"))

clonotype_rsv <- readRDS("../data/clonotypes/individualized/rsv-specific_clonotypes.rds")


# replace column names for AIRR-compliant names and filter out "PV" timepoint which was not used for the analysis
clonotype_rsv <- clonotype_rsv %>%
  filter(timepoint != "PV") %>%
  mutate(cdr3_aa_length = nchar(CDR3_aa),
         cdr3_aa = CDR3_aa,
         v_call = V_gene,
         j_call = J_gene,
         d_call = D_gene) 

# set color for fill and color aes layers
fill_col_values <- c("20-mer" = "#5F90B0", "10-mer" = "#92CDD5", "1-mer" = "#D896C1", "PostF" = "grey50")
color_values <- c("20-mer" = "#2E6997", "10-mer" = "#469698", "1-mer" = "#BE3C8C", "PostF" = "grey20")
 
# load repertoire sequencing reads info
rep_seq_ls <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = TRUE)
rep_seq_names <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = FALSE)
names(rep_seq_ls) <- sub("_st","",gsub("/","_",substr(rep_seq_names, 1,10)))

# load mAbs data
lor_mabs <- read.csv("../data/specificity/LOR_single-cells_characterized.csv")
mabs_lors <- read.csv("../data/single_cell/sc_summary_filtered.csv")

# load competition ELISA data
# edited the raw values to have the max value as the reference competition for ADI, MPE8, 101F, D25
data_comp_auc <- read.csv("../data/ELISA_comp/2023-04-28_LOR_norm-dAUC.csv", header = T, row.names = 1, encoding="UTF-8") %>%
  replace(is.na(.), 0) %>%
  mutate(specificity = factor(specificity, levels = c("PreF", "PreF/PostF", "PostF", "w.b.")),
         epitope = factor(epitope, levels = c("Ø","V", "IV","III", "II","I", "foldon", "UK-Pre","UK-DP","PostF", "WB")))

# load light chains information clonotyped
clono_light_chains <- read.table("../data/clonotypes/light_chain_assigned_clonotypes.tsv", sep = "\t", header = TRUE) 

4 Plasma profiling

The non-human primates plasma samples from this study were used for antibody competition ELISAs coated with pre-fusion or post-fusion proteins against previously characterized monoclonal antibodies. Here we show this data using different visualization methods.

4.1 Multidimensional scaling (MDS)

Multidimensional scaling aims to conserve distances between datapoints and/or samples from a set of variables. Thus, closer a point is to each other, closer their competition profile in this case. It is a good way to summarize in a 2D-space a large number of variables. The MDS input is a dissimilarity matrix, for this plot the input the matrix is based on cosine distance. Although the euclidean distance is the most used, we have decided to use the cosine distance here because the magnitude of responses are less important than the competition profile itself for us. Some NHPs were really good responders, thus having higher titers for all the competitions, thus euclidean distance would drive them far away from all the other NHPs because of their high antibody titer. Since cosine distance takes into account the distance as an angle rather than the value itself, it does not take into account the weight or the magnitude of the antibody titers.

cosine_distance <- function(x) {
#For cosine similarity matrix
Matrix <- x %>% 
  select(-c(timepoints, ID, group)) %>%
  as.matrix()
sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
sim <- sim %*% t(sim)
#Convert to cosine dissimilarity matrix (distance matrix).
D_sim <- as.dist(1 - sim)

mds <- D_sim %>%       
  cmdscale(3) %>%
  as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2", "Dim.3")

mds <- mds %>%
  mutate(group = x$group,
         timepoints = x$timepoints,
         ID = x$ID)
return(mds)
}

4.2 Plasma profiling with MDS divided by timepoint

The code below plots 4 different MDS plots. They are divided between 2 weeks after Boost 1 (B1) and Boost 2 (B2), with or without animals vaccinated with PostF immunogen. On the top of each plot, it is written B1 or B2, and legends say if PostF animals are included or not. The cosine distance was calculated separately using the custom cosine_distance() provided above, for that reason, position of points should not be compared directly between each plot but rather with points within each plot.

# calculate cosine distance and generate MDS with and without PostF
mds_b1 <- data_comp %>% filter(timepoints == "B1") %>% cosine_distance()
mds_b2 <- data_comp %>% filter(timepoints == "B2") %>% cosine_distance()
mds_no_postf_b1 <- data_comp %>% filter(group != "PostF", timepoints == "B1") %>% cosine_distance()
mds_no_postf_b2 <- data_comp %>% filter(group != "PostF", timepoints == "B2") %>% cosine_distance()
ls_mds <- list(mds_b1, mds_b2, mds_no_postf_b1, mds_no_postf_b2)

plot_list <- list()
for(f in seq_along(ls_mds)){
# Plot and color by groups
    mds_plot <- ls_mds[[f]]
    if(f %in% c(2,4)){
      mds_plot$Dim.1 <- mds_plot$Dim.1 * -1 # flip axis in first dimension
    }
    
    plot <- mds_plot %>%
      ggplot(aes(Dim.1, Dim.2, color = group, fill = group)) + 
      geom_point(size = 4, shape = 21) +
      scale_fill_manual(values = fill_col_values) +
      scale_color_manual(values = color_values) +
      lims(x = c(-.5,.5), y = c(-.4, .4)) + 
      ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
      theme(axis.ticks = element_line(size = .5),
            legend.position = c(.2,.2),
            legend.background = element_blank(),
            legend.box.background = element_rect(colour = "black")) +
      facet_wrap(~timepoints)
    
    plot_list[[f]] <- plot
    ggsave(plot = plot, filename = paste0("../", result.dir, f,"_mds_cosine-distance.pdf"), device = "pdf", width = 4, height = 4)
  }

ggarrange(plot_list[[3]], plot_list[[4]], plot_list[[1]], plot_list[[2]], ncol = 2, nrow = 2, common.legend = TRUE,legend = "top", legend.grob = get_legend(plot_list[[2]] + theme(legend.position = "top")))

4.3 Competition titers as bar plots

4.3.1 Competition for epitopes on Pref

data_comp_longer <- data_comp %>%
  pivot_longer(cols = 2:9, names_to = "mAb", values_to = "ELISA_competition") %>%
  mutate(epitopes = plyr::mapvalues(mAb, 
                                    from = c("D25.PreF", "MPE8.PreF", "ADI.PreF", "Pali.PreF", "X101F.PreF",
                                             "ADI.PostF", "X101F.PostF", "Pali.PostF"), 
                                    to = c("Ø", "III", "I", "II", "IV",
                                           "I", "IV", "II")),
         epitopes = factor(epitopes, levels = c("Ø", "III", "IV", "II", "I")),
         conformation = factor(case_when(grepl("PostF", mAb) ~ "PostF",
                                  grepl("PreF", mAb) ~ "PreF"), levels = c("PreF", "PostF")),
        timepoint_group = paste(timepoints, group, sep = "_"),
        timepoint_group_epitope = paste(timepoints, group, epitopes, sep = "_"))

my_comparisons_sites <- list(c("B1_1-mer", "B1_20-mer"),
                       c("B1_1-mer", "B1_10-mer"),
                       c("B1_1-mer", "B1_PostF"),
                       c("B1_10-mer", "B1_20-mer"),
                       c("B1_10-mer", "B1_PostF"),
                       c("B1_20-mer", "B1_PostF"),
                       c("B2_1-mer", "B2_20-mer"),
                       c("B2_1-mer", "B2_10-mer"),
                       c("B2_1-mer", "B2_PostF"),
                       c("B2_10-mer", "B2_20-mer"),
                       c("B2_10-mer", "B2_PostF"),
                       c("B2_20-mer", "B2_PostF"))

my_comparisons_1 <- lapply(my_comparisons_sites, paste0, "_I")
my_comparisons_2 <- lapply(my_comparisons_sites, paste0, "_II")
my_comparisons_3 <- lapply(my_comparisons_sites, paste0, "_III")
my_comparisons_4 <- lapply(my_comparisons_sites, paste0, "_IV")
my_comparisons_5 <- lapply(my_comparisons_sites, paste0, "_Ø")

# Selecting sites for statistical comparison only between groups for each timepoint
# removed statistical comparisons for Boost 1 site I in PreF since most values were below the threshold of detection
my_comparisons_preF <- c(my_comparisons_1[7:12], my_comparisons_2, my_comparisons_3, my_comparisons_4, my_comparisons_5)
my_comparisons_postF <- c(my_comparisons_1, my_comparisons_2, my_comparisons_4)
  

stat.test <- data_comp_longer %>%
  filter(conformation == "PreF") %>%
  wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope, 
              paired = FALSE, 
              p.adjust.method = "fdr", 
              comparisons = my_comparisons_preF)

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
ELISA_competition B2_1-mer_I B2_20-mer_I 4 5 14.0 0.413 0.812 ns
ELISA_competition B2_1-mer_I B2_10-mer_I 4 5 20.0 0.020 0.176 ns
ELISA_competition B2_1-mer_I B2_PostF_I 4 4 0.0 0.029 0.190 ns
ELISA_competition B2_10-mer_I B2_20-mer_I 5 5 6.0 0.205 0.527 ns
ELISA_competition B2_10-mer_I B2_PostF_I 5 4 0.0 0.020 0.176 ns
ELISA_competition B2_20-mer_I B2_PostF_I 5 4 2.0 0.064 0.343 ns
ELISA_competition B1_1-mer_II B1_20-mer_II 4 5 3.0 0.111 0.428 ns
ELISA_competition B1_1-mer_II B1_10-mer_II 4 5 8.0 0.730 0.939 ns
ELISA_competition B1_1-mer_II B1_PostF_II 4 4 12.0 0.343 0.772 ns
ELISA_competition B1_10-mer_II B1_20-mer_II 5 5 6.0 0.222 0.545 ns
ELISA_competition B1_10-mer_II B1_PostF_II 5 4 16.0 0.190 0.513 ns
ELISA_competition B1_20-mer_II B1_PostF_II 5 4 19.0 0.032 0.190 ns
ELISA_competition B2_1-mer_II B2_20-mer_II 4 5 11.0 0.905 0.958 ns
ELISA_competition B2_1-mer_II B2_10-mer_II 4 5 11.0 0.905 0.958 ns
ELISA_competition B2_1-mer_II B2_PostF_II 4 4 6.0 0.686 0.939 ns
ELISA_competition B2_10-mer_II B2_20-mer_II 5 5 9.0 0.548 0.883 ns
ELISA_competition B2_10-mer_II B2_PostF_II 5 4 6.0 0.413 0.812 ns
ELISA_competition B2_20-mer_II B2_PostF_II 5 4 9.0 0.905 0.958 ns
ELISA_competition B1_1-mer_III B1_20-mer_III 4 5 7.0 0.556 0.883 ns
ELISA_competition B1_1-mer_III B1_10-mer_III 4 5 8.0 0.730 0.939 ns
ELISA_competition B1_1-mer_III B1_PostF_III 4 4 13.5 0.124 0.446 ns
ELISA_competition B1_10-mer_III B1_20-mer_III 5 5 13.0 1.000 1.000 ns
ELISA_competition B1_10-mer_III B1_PostF_III 5 4 20.0 0.018 0.176 ns
ELISA_competition B1_20-mer_III B1_PostF_III 5 4 20.0 0.018 0.176 ns
ELISA_competition B2_1-mer_III B2_20-mer_III 4 5 4.0 0.190 0.513 ns
ELISA_competition B2_1-mer_III B2_10-mer_III 4 5 11.0 0.905 0.958 ns
ELISA_competition B2_1-mer_III B2_PostF_III 4 4 11.0 0.486 0.883 ns
ELISA_competition B2_10-mer_III B2_20-mer_III 5 5 8.0 0.421 0.812 ns
ELISA_competition B2_10-mer_III B2_PostF_III 5 4 11.0 0.905 0.958 ns
ELISA_competition B2_20-mer_III B2_PostF_III 5 4 16.0 0.190 0.513 ns
ELISA_competition B1_1-mer_IV B1_20-mer_IV 4 5 4.0 0.190 0.513 ns
ELISA_competition B1_1-mer_IV B1_10-mer_IV 4 5 6.0 0.413 0.812 ns
ELISA_competition B1_1-mer_IV B1_PostF_IV 4 4 8.0 1.000 1.000 ns
ELISA_competition B1_10-mer_IV B1_20-mer_IV 5 5 11.0 0.841 0.958 ns
ELISA_competition B1_10-mer_IV B1_PostF_IV 5 4 13.0 0.556 0.883 ns
ELISA_competition B1_20-mer_IV B1_PostF_IV 5 4 13.0 0.556 0.883 ns
ELISA_competition B2_1-mer_IV B2_20-mer_IV 4 5 8.0 0.730 0.939 ns
ELISA_competition B2_1-mer_IV B2_10-mer_IV 4 5 11.0 0.905 0.958 ns
ELISA_competition B2_1-mer_IV B2_PostF_IV 4 4 7.0 0.886 0.958 ns
ELISA_competition B2_10-mer_IV B2_20-mer_IV 5 5 4.0 0.095 0.428 ns
ELISA_competition B2_10-mer_IV B2_PostF_IV 5 4 5.0 0.286 0.671 ns
ELISA_competition B2_20-mer_IV B2_PostF_IV 5 4 12.0 0.730 0.939 ns
ELISA_competition B1_1-mer_Ø B1_20-mer_Ø 4 5 11.0 0.898 0.958 ns
ELISA_competition B1_1-mer_Ø B1_10-mer_Ø 4 5 12.0 0.701 0.939 ns
ELISA_competition B1_1-mer_Ø B1_PostF_Ø 4 4 12.0 0.186 0.513 ns
ELISA_competition B1_10-mer_Ø B1_20-mer_Ø 5 5 10.0 0.666 0.939 ns
ELISA_competition B1_10-mer_Ø B1_PostF_Ø 5 4 16.0 0.109 0.428 ns
ELISA_competition B1_20-mer_Ø B1_PostF_Ø 5 4 16.0 0.109 0.428 ns
ELISA_competition B2_1-mer_Ø B2_20-mer_Ø 4 5 13.0 0.556 0.883 ns
ELISA_competition B2_1-mer_Ø B2_10-mer_Ø 4 5 12.0 0.730 0.939 ns
ELISA_competition B2_1-mer_Ø B2_PostF_Ø 4 4 16.0 0.029 0.190 ns
ELISA_competition B2_10-mer_Ø B2_20-mer_Ø 5 5 13.0 1.000 1.000 ns
ELISA_competition B2_10-mer_Ø B2_PostF_Ø 5 4 20.0 0.020 0.176 ns
ELISA_competition B2_20-mer_Ø B2_PostF_Ø 5 4 20.0 0.020 0.176 ns
data_comp_longer %>%
  filter(conformation == "PreF") %>%
  ggplot(aes(x = timepoint_group,
            y = ELISA_competition,
            color = group,
            fill = group)) +
  geom_point(size = 2, shape = 21) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
  scale_y_log10() +
  stat_summary(fun.y = mean, 
               geom = "crossbar",
               color = "black") +
  geom_hline(yintercept = 10, linetype = "dashed") +
  labs(title = "PreF Antigenic Sites", y = "50% Competition Titer", x = "") +
  theme(axis.ticks = element_line(size = .5),
        legend.background = element_blank(),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
  facet_wrap(~epitopes, nrow = 1) 

4.3.2 Competition for epitopes on Pref

stat.test <- data_comp_longer %>%
  filter(conformation == "PostF") %>%
  wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope, 
              paired = FALSE, 
              p.adjust.method = "fdr", 
              comparisons = my_comparisons_postF) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
ELISA_competition B1_1-mer_I B1_20-mer_I 4 5 0.0 0.020 0.088 ns 6533.097 B1_1-mer…. 1 7
ELISA_competition B1_1-mer_I B1_10-mer_I 4 5 0.0 0.020 0.088 ns 7250.111 B1_1-mer…. 1 4
ELISA_competition B1_1-mer_I B1_PostF_I 4 4 1.0 0.059 0.127 ns 7967.125 B1_1-mer…. 1 10
ELISA_competition B1_10-mer_I B1_20-mer_I 5 5 3.0 0.056 0.127 ns 8684.139 B1_10-me…. 4 7
ELISA_competition B1_10-mer_I B1_PostF_I 5 4 9.0 0.905 0.905 ns 9401.153 B1_10-me…. 4 10
ELISA_competition B1_20-mer_I B1_PostF_I 5 4 12.0 0.730 0.773 ns 10118.168 B1_20-me…. 7 10
ELISA_competition B2_1-mer_I B2_20-mer_I 4 5 0.0 0.016 0.088 ns 10835.182 B2_1-mer…. 13 19
ELISA_competition B2_1-mer_I B2_10-mer_I 4 5 5.0 0.286 0.381 ns 11552.196 B2_1-mer…. 13 16
ELISA_competition B2_1-mer_I B2_PostF_I 4 4 0.0 0.029 0.088 ns 12269.210 B2_1-mer…. 13 22
ELISA_competition B2_10-mer_I B2_20-mer_I 5 5 7.0 0.310 0.385 ns 12986.224 B2_10-me…. 16 19
ELISA_competition B2_10-mer_I B2_PostF_I 5 4 0.0 0.016 0.088 ns 13703.238 B2_10-me…. 16 22
ELISA_competition B2_20-mer_I B2_PostF_I 5 4 3.0 0.111 0.195 ns 14420.252 B2_20-me…. 19 22
ELISA_competition B1_1-mer_II B1_20-mer_II 4 5 3.5 0.125 0.205 ns 15137.266 B1_1-mer…. 2 8
ELISA_competition B1_1-mer_II B1_10-mer_II 4 5 8.5 0.771 0.793 ns 15854.280 B1_1-mer…. 2 5
ELISA_competition B1_1-mer_II B1_PostF_II 4 4 0.0 0.026 0.088 ns 16571.294 B1_1-mer…. 2 11
ELISA_competition B1_10-mer_II B1_20-mer_II 5 5 6.5 0.236 0.327 ns 17288.309 B1_10-me…. 5 8
ELISA_competition B1_10-mer_II B1_PostF_II 5 4 1.0 0.034 0.088 ns 18005.323 B1_10-me…. 5 11
ELISA_competition B1_20-mer_II B1_PostF_II 5 4 2.0 0.064 0.127 ns 18722.337 B1_20-me…. 8 11
ELISA_competition B2_1-mer_II B2_20-mer_II 4 5 1.0 0.032 0.088 ns 19439.351 B2_1-mer…. 14 20
ELISA_competition B2_1-mer_II B2_10-mer_II 4 5 4.0 0.190 0.297 ns 20156.365 B2_1-mer…. 14 17
ELISA_competition B2_1-mer_II B2_PostF_II 4 4 0.0 0.029 0.088 ns 20873.379 B2_1-mer…. 14 23
ELISA_competition B2_10-mer_II B2_20-mer_II 5 5 8.0 0.421 0.474 ns 21590.393 B2_10-me…. 17 20
ELISA_competition B2_10-mer_II B2_PostF_II 5 4 0.0 0.016 0.088 ns 22307.407 B2_10-me…. 17 23
ELISA_competition B2_20-mer_II B2_PostF_II 5 4 2.0 0.064 0.127 ns 23024.421 B2_20-me…. 20 23
ELISA_competition B1_1-mer_IV B1_20-mer_IV 4 5 1.0 0.032 0.088 ns 23741.435 B1_1-mer…. 3 9
ELISA_competition B1_1-mer_IV B1_10-mer_IV 4 5 4.5 0.219 0.320 ns 24458.449 B1_1-mer…. 3 6
ELISA_competition B1_1-mer_IV B1_PostF_IV 4 4 2.0 0.114 0.195 ns 25175.464 B1_1-mer…. 3 12
ELISA_competition B1_10-mer_IV B1_20-mer_IV 5 5 6.0 0.222 0.320 ns 25892.478 B1_10-me…. 6 9
ELISA_competition B1_10-mer_IV B1_PostF_IV 5 4 6.0 0.413 0.474 ns 26609.492 B1_10-me…. 6 12
ELISA_competition B1_20-mer_IV B1_PostF_IV 5 4 13.0 0.556 0.607 ns 27326.506 B1_20-me…. 9 12
ELISA_competition B2_1-mer_IV B2_20-mer_IV 4 5 0.0 0.016 0.088 ns 28043.520 B2_1-mer…. 15 21
ELISA_competition B2_1-mer_IV B2_10-mer_IV 4 5 6.0 0.413 0.474 ns 28760.534 B2_1-mer…. 15 18
ELISA_competition B2_1-mer_IV B2_PostF_IV 4 4 0.0 0.029 0.088 ns 29477.548 B2_1-mer…. 15 24
ELISA_competition B2_10-mer_IV B2_20-mer_IV 5 5 7.0 0.310 0.385 ns 30194.562 B2_10-me…. 18 21
ELISA_competition B2_10-mer_IV B2_PostF_IV 5 4 0.0 0.016 0.088 ns 30911.576 B2_10-me…. 18 24
ELISA_competition B2_20-mer_IV B2_PostF_IV 5 4 3.0 0.111 0.195 ns 31628.590 B2_20-me…. 21 24
data_comp_longer %>%
  filter(conformation == "PostF") %>%
  ggplot(aes(x = timepoint_group,
            y = ELISA_competition,
            color = group,
            fill = group)) +
  geom_point(size = 2, shape = 21) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
  scale_y_log10() +
  stat_summary(fun.y = mean, 
               geom = "crossbar",
               color = "black") +
  geom_hline(yintercept = 10, linetype = "dashed") +
  labs(title = "PostF Antigenic Sites", y = "50% Competition Titer", x = "") +
  theme(axis.ticks = element_line(size = .5),
        legend.background = element_blank(),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
  facet_wrap(~epitopes, nrow = 1) 

5 Save metadata from B cell repertoire sequencing

Save table containing read information from the high-throughput sequencing. It includes number of raw reads per animal and number of processed reads with high quality assignment of HV and HJ genes using IgDiscover software as an IgBlast wrapper.

rep_seq_stat <-  lapply(rep_seq_ls, function(x) fromJSON(txt = x ))
rep_seq_stat <-  lapply(rep_seq_stat, as.data.frame)
rep_seq_stat <- data.table::rbindlist(rep_seq_stat, idcol = "ID", fill = TRUE) 
rep_seq_stat <- rep_seq_stat %>%
  select(c(ID,read_preprocessing.raw_reads, assignment_filtering.total, assignment_filtering.has_vj_assignment, 17:21))

write.table(rep_seq_stat, "../data/processed_data/summary_sequencing_table.tsv", row.names = FALSE, sep = "\t", quote = FALSE )

6 Comparing repertoires

6.1 Processing clonotype datasets

After merging the antigen-specific sequences and bulk sequencing, we run the the clonotype module from IgDiscover to assign clonotype grouping to each sequence. We have done that for both the individualized germline and the KIMDB database. Here we are reading those datasets

ls <- list.files("../data/clonotypes", recursive = T, full.names = T)
ls <- ls[grepl("rsv", ls) & grepl("individualized|nestor-rm", ls) ]
names(ls) <- basename(dirname(ls))
rds_merge <- lapply(ls, readRDS)
rds_merge <- rbindlist(rds_merge, idcol = TRUE, fill = TRUE)
rds_merge <- rds_merge %>%
  select(.id, specificity_group, sc_clone_grp, grp, new_name, ID_timepoint, V_SHM, V_errors, CDR3_aa, cdr3_aa, V_gene, J_gene, v_call, j_call) %>%
  mutate(
    Timepoint = factor(gsub(".*_", "", ID_timepoint), levels = c("PV", "B1", "B2", "Single-cell")),
    ID = gsub("_.*", "", ID_timepoint),
    database = plyr::mapvalues(.id,
      from = c("nestor-rm", "individualized"),
      to = c("KIMDB", "Individualized")
    ),
    database = factor(database, levels = c("KIMDB", "Individualized")),
    Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
    cdr3_aa_length = ifelse(is.na(CDR3_aa), nchar(cdr3_aa), nchar(CDR3_aa)),
    v_call = ifelse(is.na(v_call), V_gene, v_call),
    j_call = ifelse(is.na(j_call), J_gene, j_call)
  ) %>%
  group_by(.id, ID_timepoint) %>%
  distinct(new_name, .keep_all = TRUE) %>%
  ungroup() %>%
  filter(database %in% c("KIMDB", "Individualized"))

6.2 Plot number of sequences per animal

Plotting sequencing depth for each timepoint per animal. This indicates that the sequencing depth for Boost 1 was lower, thus for that reason all the analysis were normalized by sequencing depth to take that into account.

# samples from BR2-B2 from E17 were merged with TR1-B2
# this was done due to low sequencing depth for that animal and large expansion of uncharacterized clones 
full_rep_indiv <- read.table("../data/processed_data/summary_sequencing_table.tsv", 
                             header = TRUE) %>%
  separate(ID, sep = "_", into = c("sample", "ID")) %>%
  filter(sample != "TR2-B2") %>%
  mutate(sample = ifelse(sample == "BR2-B2", "TR1-B2", sample),
         timepoint = case_when(sample == "igm" ~ "PV",
                            grepl("B1", sample) ~ "B1",
                            grepl("B2", sample) ~ "B2"),
         timepoint = factor(timepoint, levels = c("PV", "B1", "B2"))) %>%
  group_by(timepoint, ID) %>%
  summarise(across(where(is.numeric), sum)) %>%
  ungroup()
  
# since here we wanted to compare all groups, we used Dunn's test
stat.test <- full_rep_indiv %>%
              dunn_test(formula = assignment_filtering.has_cdr3 ~ timepoint,
                          p.adjust.method = "fdr") %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
assignment_filtering.has_cdr3 PV B1 9 9 -3.6525703 0.0002596 0.0007789 *** 1393265 PV, B1 1 2
assignment_filtering.has_cdr3 PV B2 9 9 -0.2672612 0.7892680 0.7892680 ns 1491771 PV, B2 1 3
assignment_filtering.has_cdr3 B1 B2 9 9 3.3853091 0.0007110 0.0010665 ** 1590276 B1, B2 2 3
full_rep_indiv %>%
  ggdotplot(y = "assignment_filtering.has_cdr3", x = "timepoint", 
            group = "timepoint", 
            fill = "ID", 
            size = 1,) +
  geom_boxplot(alpha = .2, outlier.shape = NA) +
  ggpubr::stat_compare_means(method = "kruskal") +
  labs(x = "Timepoint", y = "# good quality aligned sequences") +
  scale_fill_viridis_d() +
  stat_pvalue_manual(stat.test %>% mutate(p.adj = round(p.adj, 4)), label = "p.adj") +
   ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = 1, axis_text_angle = 45, base_rect_size = 1.5)

ggsave(paste0("../", result.dir, "repertoire_depth.pdf"), width = 5, height = 3)

6.3 Generate data for clone size per animal per database

rds_summary <- rds_merge %>%
  group_by(database, ID_timepoint, grp) %>%
  summarise(
    ID = first(ID), Timepoint = first(Timepoint), Group = first(Group),
    clonal_size = n(),
    database,
    sc_clone_grp = first(sc_clone_grp),
    V_errors = mean(V_errors),
    specificity_group = first(specificity_group),
    cdr3_aa_length = mean(cdr3_aa_length),
    v_call = first(v_call),
    j_call = first(j_call)
  ) %>%
  ungroup() %>%
  group_by(database, ID_timepoint) %>%
  mutate(clonal_size_rank = dense_rank(dplyr::desc(clonal_size))) %>%
  ungroup() %>%
  distinct()

rds_summary_noPV <- rds_summary %>%
  filter(Timepoint != "PV", Timepoint != "Single-cell")

rds_summary_save <- rds_summary %>%
  filter(Timepoint %in% c("Single-cell", "B1","B2"),
         database == "Individualized") %>%
  group_by(ID_timepoint) %>%
  summarise(mean_clonal_size = mean(clonal_size),
            median_clonal_size = median(clonal_size),
            geom_clonal_size = exp(mean(log(clonal_size))),
            unique_clones = sum(clonal_size == 1),
            total_clones_detected = n(),
            percentage_unique_clones = round(unique_clones/total_clones_detected*100,digits = 2)) 

rds_summary_save %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
ID_timepoint mean_clonal_size median_clonal_size geom_clonal_size unique_clones total_clones_detected percentage_unique_clones
E11_B1 73.765766 24.0 23.618580 7 111 6.31
E11_B2 65.842324 18.0 16.139896 22 241 9.13
E11_Single-cell 1.596491 1.0 1.310902 169 228 74.12
E12_B1 96.802632 8.5 11.635387 5 76 6.58
E12_B2 62.696296 9.0 9.733452 18 135 13.33
E12_Single-cell 1.556000 1.0 1.304827 185 250 74.00
E14_B1 79.458333 9.0 9.683393 11 96 11.46
E14_B2 111.530612 10.5 12.335661 29 196 14.80
E14_Single-cell 2.041096 1.0 1.527244 139 219 63.47
E16_B1 73.535519 14.0 15.223779 14 183 7.65
E16_B2 118.491228 31.5 29.123260 17 228 7.46
E16_Single-cell 1.996094 1.0 1.489967 169 256 66.02
E17_B1 38.460526 8.0 9.753009 8 76 10.53
E17_B2 55.614035 6.0 7.902730 31 171 18.13
E17_Single-cell 1.665158 1.0 1.381309 151 221 68.33
E18_B1 41.323529 11.0 11.772174 9 102 8.82
E18_B2 72.662921 21.0 15.643111 23 178 12.92
E18_Single-cell 1.641026 1.0 1.368241 135 195 69.23
E21_B1 35.201835 5.0 6.887328 21 109 19.27
E21_B2 45.202312 14.0 13.455144 16 173 9.25
E21_Single-cell 1.674208 1.0 1.385378 151 221 68.33
E23_B1 59.675497 9.0 11.417886 19 151 12.58
E23_B2 63.044025 13.0 14.180265 18 159 11.32
E23_Single-cell 1.916279 1.0 1.484490 136 215 63.26
E24_B1 55.469231 10.0 11.115858 18 130 13.85
E24_B2 62.310735 9.0 10.429075 29 177 16.38
E24_Single-cell 1.829897 1.0 1.400170 137 194 70.62
write.csv(rds_summary_save, 
          paste0("../", result.dir, "clone_size_mean_median.csv"),row.names = FALSE)

6.4 Process and merge dataset with metadata

rds_indiv <- rds_summary %>%
  filter(
    database == "Individualized",
    Timepoint != "Single-cell",
    Timepoint != "PV"
  ) %>%
  mutate(
    LOR = ifelse(grepl(pattern = paste0(lor_mabs$well_ID, collapse = "|"), x = sc_clone_grp), "cloned", "not_cloned"),
    LOR = factor(LOR, levels = c("cloned", "not_cloned"), ordered = TRUE)
  )
rds_indiv$Timepoint <- droplevels(rds_indiv$Timepoint)
all <- rds_indiv %>%
  tidyr::expand(ID, Timepoint, grp) %>%
  filter(Timepoint != "Single-cell")

rds_indiv <- rds_indiv %>%
  right_join(all) %>%
  mutate(
    ID_timepoint = paste(ID, Timepoint, sep = "_"),
    database = "individualized",
    clonal_size = ifelse(is.na(clonal_size), 0, clonal_size)
  ) %>%
  arrange(grp, ID_timepoint) %>%
  tidyr::fill(LOR, Group, sc_clone_grp)

6.5 Cumulative plots

6.5.1 Clone size area plot (cumulative) of 20-mer and 1-mer, divided by specificity (DP & PreF)

rds_indiv %>%
  group_by(Group, Timepoint, specificity_group) %>%
  summarise(clonal_size = sum(clonal_size)) %>%
  tidyr::drop_na() %>%
  filter(specificity_group != "PostF") %>%
  ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
  geom_area(color = "black") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 80000)) +
  scale_x_discrete(expand = c(0.02, 0.02)) +
  scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
  labs(y = "Clonal size") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~Group)

ggsave(paste0("../", result.dir, "rsv_specific_clonal_size_area_plot_spec.pdf"), width = 8, height = 2)

6.5.2 Clone size area plot (cumulative) per animal, divided by specificity (DP & PreF)

rds_indiv %>%
  group_by(Group, ID, Timepoint, specificity_group) %>%
  summarise(clonal_size = sum(clonal_size)) %>%
  tidyr::drop_na() %>%
  filter(specificity_group != "PostF") %>%
  ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
  geom_area(color = "black") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 30000)) +
  scale_x_discrete(expand = c(0.02, 0.02)) +
  scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
  labs(y = "Clonal size") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~Group + ID, ncol = 5)

6.6 Correlation clone size per animal per database

df_all <- data.frame()
for (timepoints in unique(rds_summary_noPV$Timepoint)) {
  for (databases in unique(rds_summary_noPV$database)) {
    rds_timepoint <- rds_summary_noPV %>%
      filter(Timepoint == timepoints, database == databases) %>%
      arrange(desc(clonal_size)) %>%
      pull(clonal_size)

    rds_indiv_corr <- rds_summary_noPV %>%
      filter(
        database == "Individualized",
        Timepoint == timepoints
      ) %>%
      arrange(desc(clonal_size)) %>%
      pull(clonal_size)

    length(rds_indiv_corr) <- length(rds_timepoint)

    df <- data.frame(
      size_indiv = rds_indiv_corr,
      clonal_size = rds_timepoint,
      Timepoint = timepoints,
      database = databases
    )
    df[is.na(df)] <- 0
    df_all <- rbind(df, df_all)
  }
}


df_all %>%
  filter(database != "Individualized") %>%
  mutate(database = factor(database, levels = c("KIMDB", "Individualized"))) %>%
  ggplot(aes(x = size_indiv, y = clonal_size)) +
  geom_point(size = 1) +
  geom_abline(intercept = 0, slope = 1, color = "black", linetype = "dashed") +
  scale_x_continuous(
    trans = pseudo_log_trans(base = 10),
    breaks = c(1, 10, 100, 1000, 10000),
    labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
  ) +
  scale_y_continuous(
    trans = pseudo_log_trans(base = 10),
    breaks = c(1, 10, 100, 1000, 10000),
    labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
  ) +
  labs(y = "Clonal size\n (KIMDB)", x = "Clonal size\n(Individualized database)") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~ Timepoint)

ggsave(paste0("../", result.dir, "individualized_db_comparison.pdf"), width = 12, height = 6)

6.7 IGHV-J pairing per database

for(i in c("Individualized", "KIMDB")){
rds_summary_noPV <- rds_summary %>%
  # check if we want to filter clonotypes by size or not
  filter(database == i) %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  mutate(v_call = sub("\\*.*","", v_call),
         j_call = sub("\\*.*|-.*","", j_call),
         clonal_size_log = log10(clonal_size),
         Group = factor(Group, levels = c("20-mer", "1-mer")),
         specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP", "PostF"))) %>%
  group_by(Group) %>%
  mutate(clonal_size_perc = (clonal_size/sum(clonal_size)) * 100) %>%
  ungroup() %>% 
  group_by(Group, specificity_group, v_call, j_call) %>%
  summarise(clonal_size_perc = sum(clonal_size_perc)) %>%
  filter(specificity_group != "PostF") %>% droplevels()

plot_vj <-  rds_summary_noPV %>%
    ggplot(aes(x= v_call, y = j_call, fill = clonal_size_perc)) +
      geom_tile(color = "black") +
      scale_fill_viridis_c(option = "viridis", direction = 1) +
      scale_y_discrete(position = "right") +
      ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
      coord_equal() +
      theme(axis.text.x = element_text(angle = 90, size = 5, hjust = 1, vjust = 0.5, face = "bold", colour = "black"),
            axis.text.y = element_text(face = "bold", colour = "black", size = 5),
            strip.text = element_text(face = "bold", size = 7),
            panel.grid.major = element_blank(), 
            panel.grid.minor = element_blank(),
            legend.position =  "top",
            axis.ticks = element_line(size = .5),
            legend.title = element_text()) +
      labs(fill = "Sequence count\n(% RSV repertoire\n per group)", x = "IGHV alleles", y = "IGHJ alleles", title = paste0(i, " Database")) +
      facet_wrap(~ specificity_group + Group, ncol = 1, strip.position = "left")

print(plot_vj)

ggsave(plot = plot_vj, filename = paste0("../",result.dir,i,"_IGHV-IGHJ_pairing-rsv-specific-sequences_perc.pdf"), width = 9, height = 7)
}

6.8 Count unique V-J pairs

# groups to perform statistical comparisons
my_comparisons <- list(c("Total_B1_20-mer", "Total_B1_1-mer"),
                       c("Total_B2_20-mer", "Total_B2_1-mer"),
                       c("PreF_B1_20-mer", "PreF_B1_1-mer"),
                       c("PreF_B2_20-mer", "PreF_B2_1-mer"),
                       c("DP_B1_20-mer", "DP_B1_1-mer"),
                       c("DP_B2_20-mer", "DP_B2_1-mer"))

rds_summary_noPV <- rds_summary %>%
  # If want to remove clonal sizer < 1, do it here.
  filter(database == "Individualized", Timepoint != "Single-cell", Timepoint != "PV") %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
         Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(ID_timepoint, specificity_group, Group_specificity, Group) %>%
  mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
  distinct(v_j_calls) %>%
  summarise(unique_v_j = n()) %>%
  ungroup()

rds_summary_noPV %>% write.csv(paste0("../", result.dir,"unique_HV_HJ_pairing-data.csv"), row.names = F)

stat.test <- rds_summary_noPV %>%
              wilcox_test(formula = unique_v_j ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
unique_v_j Total_B1_20-mer Total_B1_1-mer 5 4 14 0.413 0.496 ns 137.040 Total_B1…. 1 2
unique_v_j Total_B2_20-mer Total_B2_1-mer 5 4 11 0.901 0.901 ns 150.288 Total_B2…. 3 4
unique_v_j PreF_B1_20-mer PreF_B1_1-mer 5 4 4 0.176 0.264 ns 163.536 PreF_B1_…. 5 6
unique_v_j PreF_B2_20-mer PreF_B2_1-mer 5 4 0 0.016 0.048
176.784 PreF_B2_…. 7 8
unique_v_j DP_B1_20-mer DP_B1_1-mer 5 4 20 0.016 0.048
190.032 DP_B1_20…. 9 10
unique_v_j DP_B2_20-mer DP_B2_1-mer 5 4 18 0.064 0.127 ns 203.280 DP_B2_20…. 11 12
rds_summary_noPV %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "unique_v_j", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 1) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 200)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "HV and HJ unique pairs\n(Count)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 150) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "v-j-pair_count.pdf"), width = 5, height = 3)

6.9 VennDiagram of HV-HJ sharing

rds_summary_noPV <- rds_summary %>%
  # decide if clonal size greater than 1 should be included
  filter(database == "Individualized", Timepoint != "Single-cell") %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
         Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(Group) %>%
  mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
  distinct(v_j_calls) 

list_v_j <- list("20-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "20-mer"],
                 "1-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "1-mer"])

ggVennDiagram::ggVennDiagram(list_v_j, label_size = 7, 
                             label_alpha = 0,
                             set_color = "black",
                             set_size = 9) +
  scale_fill_gradient(low = "#F4FAFE", high = "#4981BF") +
  scale_color_manual(values = c("#5F90B0", "#D896C1")) +
  theme(legend.position = "none")

ggsave(paste0("../", result.dir,"shared-unique-v-j_pairing.pdf"), height = 5, width = 5)

7 Generate processed files for diversity index estimation

Intermediate files are commented out, so it will not save all of them. One could uncomment them if all the intermediate files are needed. Intermediate files were used to run Recon (v2.5) (Kaplinsky & Arnaout, Nat Commmun, 2016) according to default parameters, more about running Recon is described in the next section Recon estimates for RSV-specific diversity.

# add column for loop of ID, timepoint and specificity
clonotype_rsv <- clonotype_rsv %>%
  mutate(ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group))

# create empty lists for loop
filtered_animal_rsv <- list()
clonotype_summary_rsv <- list()

# loop to create summary and full clonotype files for saving and/or following analysis
for (animals in unique(clonotype_rsv$ID_timepoint_spec)) {
  # save full clonotype files
  filtered_animal_rsv[[animals]] <- clonotype_rsv %>%
    filter(ID_timepoint_spec == animals) %>%
    as.data.frame()
 # write.csv(filtered_animal_rsv[[animals]], paste0("../", result.dir, animals, "_clonotype_full.csv"), row.names = FALSE)

  # save summary files
  clonotype_summary_rsv[[animals]] <- filtered_animal_rsv[[animals]] %>%
    group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
    summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup()
  #write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
  }

# doing the same thing but for total, that means not account for specificities (PreF, DP,PostF)
for (animals in unique(clonotype_rsv$ID_timepoint)) {
  # save summary files
  clonotype_summary_rsv[[paste0(animals, "_total")]] <- clonotype_rsv %>%
    filter(ID_timepoint == animals) %>%
    as.data.frame() %>%
    mutate(
      specificity_group = "Total",
      ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group)
    ) %>%
    group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
    summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup()
 # write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
  }


# Save clonotype summary per animal, taking into account ID, timepoint and clonal group
filtered_animal_rsv_summary <- list()

for (animals in unique(clonotype_rsv$ID)) {
  # save summary files
  clonotype_rsv %>%
    filter(ID == animals) %>%
    as.data.frame() %>%
    group_by(grp, timepoint, ID) %>%
    summarise(clonal_size = n(), single_cells = first(sc_clone_grp), v_call = first(v_call), j_call = first(j_call), cdr3_aa = first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup() 
   # write.csv(paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
    }

to_recon_rsv <- data.table::rbindlist(clonotype_summary_rsv) %>%
  select(clonal_size, ID_timepoint_spec) %>%
  group_by(clonal_size, ID_timepoint_spec) %>%
  summarise(size = n()) %>%
  ungroup()


for (i in unique(to_recon_rsv$ID_timepoint_spec)) {
  to_recon_table <- to_recon_rsv %>%
    filter(ID_timepoint_spec == i) %>%
    select(clonal_size, size)

 # fwrite(to_recon_table, file = paste0("../", result.dir, i, "_rsv_file_to_recon.txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}

7.1 Calculating Species richness

7.1.1 Chao1 estimates for RSV-specific diversity

Species richness (Chao1) was calculated using the vegan r package. The samples were first subsampled 100 times for each animal/timepoint and then the species richness was estimated. The mean of those 100x replicates were used for plotting.

Here is the processing and estimation of species richness:

rsv_chao <- lapply(clonotype_summary_rsv, function(x) select(x, "clonal_size"))


# subsample and replicate the subsampling 100 times for higher accuracy

chaox100 <- function(x, value_to_subsample) {
  replicate(100, {
    subsample <- vegan::rrarefy(x, value_to_subsample)
    chao <- vegan::estimateR(subsample)
    return(chao)
  })
}

diff_spec_timepoints <- unique(substring(names(clonotype_summary_rsv), 5))

# subsample based on lowest total clonotype size per group
chao_list_results <- list()
for (spec_timepoint in diff_spec_timepoints) {
  print(spec_timepoint)
  rsv_filtered <- rsv_chao[grepl(spec_timepoint, names(rsv_chao))]
  min_to_subsample <- min(unlist(lapply(rsv_filtered, colSums)))
  chao_list_results[[spec_timepoint]] <- lapply(rsv_filtered, chaox100, min_to_subsample)
}
## [1] "Single-cell_DP"
## [1] "B1_DP"
## [1] "B2_DP"
## [1] "Single-cell_PreF"
## [1] "B1_PreF"
## [1] "B2_PreF"
## [1] "Single-cell_PostF"
## [1] "B1_PostF"
## [1] "B2_PostF"
## [1] "Single-cell_total"
## [1] "B1_total"
## [1] "B2_total"
change_names <- function(x) {
  names(x) <- gsub("_.*", "", names(x))
  x
}

# adjusted dataset for plotting
{
  chao_results_df <- purrr::map(chao_list_results, ~ change_names(.x))
  chao_results_df <- rbindlist(chao_results_df, use.names = TRUE, idcol = TRUE, fill = TRUE)
  chao_results_df$algorithm <- rep(c("Obs", "Chao1", "Chao1_se", "ACE", "ACE_se"), nrow(chao_results_df) / 5)
  # save intermediate file
  chao_results_df %>%
    filter(algorithm %in% c("Chao1", "ACE")) %>%
    dplyr::rename(Timepoint_specificity = .id) %>%
    write.csv(paste0("../", result.dir, "rsv_repertoire_diversity.csv"), row.names = FALSE)
  # diversity mean of x100 replicates per animal
  chao_results_df %>%
    filter(algorithm %in% c("Chao1", "ACE")) %>%
    dplyr::rename(Timepoint_specificity = .id) %>%
    group_by(algorithm, Timepoint_specificity) %>%
    summarise_all(.funs = mean) %>%
    write.csv(paste0("../", result.dir, "rsv_repertoire_diversity_mean.csv"), row.names = FALSE)

  chao_results_df <- tidyr::pivot_longer(chao_results_df, cols = 2:(length(chao_results_df) - 1), names_to = c("ID")) %>%
    tidyr::separate(.id, c("Timepoint", "Specificity"), sep = "_") %>%
    mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ))
}

7.1.2 Chao1 plotting between groups for RSV-specific diversity

Here is the plotting and comparison between vaccinated group:

chao_results_df$Specificity[chao_results_df$Specificity == "total"] <- "Total"

chao_results_df <- chao_results_df %>%
  filter(algorithm != "Chao1_se" & algorithm != "ACE_se") %>%
  filter(algorithm == "Chao1", Timepoint != "PV", Timepoint != "Single-cell",) %>%
  mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
    Group_specificity = paste(Specificity, Timepoint, Group, sep = "_"),
    Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(ID, Group, Timepoint, Specificity, algorithm, Group_specificity) %>%
  summarise(value = mean(value)) %>%
  tidyr::drop_na() %>%
  ungroup()

write.csv(chao_results_df, paste0("../", result.dir, "chao1_results_plot_values.csv"), row.names = FALSE)


stat.test <- chao_results_df %>%
              wilcox_test(formula = value ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
value Total_B1_20-mer Total_B1_1-mer 5 4 16 0.190 0.285 ns 272.3352 Total_B1…. 1 2
value Total_B2_20-mer Total_B2_1-mer 5 4 15 0.286 0.343 ns 301.8278 Total_B2…. 3 4
value PreF_B1_20-mer PreF_B1_1-mer 5 4 7 0.556 0.556 ns 331.3205 PreF_B1_…. 5 6
value PreF_B2_20-mer PreF_B2_1-mer 5 4 2 0.064 0.127 ns 360.8131 PreF_B2_…. 7 8
value DP_B1_20-mer DP_B1_1-mer 5 4 20 0.016 0.048
390.3058 DP_B1_20…. 9 10
value DP_B2_20-mer DP_B2_1-mer 5 4 20 0.016 0.048
419.7984 DP_B2_20…. 11 12
chao_results_df %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "value", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 1) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "Species richness\n(Chao1)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "chao1_species_richness.pdf"), width = 5, height = 3)

7.1.3 Recon estimates for RSV-specific diversity

According to Recon default settings and tutorial (check Recon manual), the count files generated previously were used to create the fitfiles.txt that were used as an input for generating the diversity_table.txt for all the samples.

To generate the the fitfile for each count file, the bash script used in a for loop was:

#!/bin/sh
set -euo pipefail
FILE=$1
python recon_v2.5.py -R -t 30 -c -o ${FILE}_fitfile.txt $FILE

python recon_v2.5.py -x --x_max 30 -o ${FILE}_plotfile.txt -b error_bar_parameters.txt ${FILE}_fitfile.txt

Then, each fit file was used for generating the diversity_table.txt with the command:

python recon_v2.5.py -v -D -b error_bar_parameters.txt -o output_diversity_table.txt *rsv_file_to_recon.txt_fitfile.txt

The results from diversity_table.txt for all the samples were used as input for plotting.

recon_res <- read.table("../data/diversity_index/recon/rsv_output_diversity_table.txt", header = TRUE) %>%
  filter(Timepoint != "PV", Timepoint != "Single-cell", Specificity != "PostF") %>%
  mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
   Group_specificity = paste(Specificity, Timepoint, Group, sep = "_")) %>%
  mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  ungroup()

write.csv(recon_res, paste0("../", result.dir, "recon_results_plot_values.csv"), row.names = FALSE)
  

stat.test <- recon_res %>%
              wilcox_test(formula = est_0.0D ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
est_0.0D Total_B1_20-mer Total_B1_1-mer 5 4 16 0.190 0.228 ns 275.7082 Total_B1…. 1 2
est_0.0D Total_B2_20-mer Total_B2_1-mer 5 4 16 0.190 0.228 ns 304.8120 Total_B2…. 3 4
est_0.0D PreF_B1_20-mer PreF_B1_1-mer 5 4 7 0.556 0.556 ns 333.9159 PreF_B1_…. 5 6
est_0.0D PreF_B2_20-mer PreF_B2_1-mer 5 4 2 0.064 0.127 ns 363.0197 PreF_B2_…. 7 8
est_0.0D DP_B1_20-mer DP_B1_1-mer 5 4 20 0.016 0.048
392.1236 DP_B1_20…. 9 10
est_0.0D DP_B2_20-mer DP_B2_1-mer 5 4 20 0.016 0.048
421.2274 DP_B2_20…. 11 12
recon_res %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "est_0.0D", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 1) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "Species richness\n(Recon)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "recon_species_richness.pdf"), width = 5, height = 3)

7.2 Somatic hypermutation comparisons

7.2.1 Somatic hypermutation over time for individualized database

Here the SHM is shown for all the sequences for every animal combined. Data is divided by 20-mer, 1-mer, and specificities. T

rds_indiv_total <- rds_indiv %>%
  mutate(specificity_group = "Total")

rds_indiv <- rbind(rds_indiv, rds_indiv_total)


rds_indiv_plot <- rds_indiv %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_")) %>%
  mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>% ungroup() 

stat.test <- rds_indiv_plot %>%
              wilcox_test(formula = V_errors ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
V_errors Total_B1_20-mer Total_B1_1-mer 651 383 119282.5 0.246 0.492 ns 68.320 Total_B1…. 1 2
V_errors Total_B2_20-mer Total_B2_1-mer 976 682 349160.0 0.088 0.265 ns 73.504 Total_B2…. 3 4
V_errors PreF_B1_20-mer PreF_B1_1-mer 205 205 21279.5 0.824 0.841 ns 78.688 PreF_B1_…. 5 6
V_errors PreF_B2_20-mer PreF_B2_1-mer 346 389 69762.5 0.391 0.587 ns 83.872 PreF_B2_…. 7 8
V_errors DP_B1_20-mer DP_B1_1-mer 413 144 29401.0 0.841 0.841 ns 89.056 DP_B1_20…. 9 10
V_errors DP_B2_20-mer DP_B2_1-mer 583 250 79353.0 0.042 0.251 ns 94.240 DP_B2_20…. 11 12
rds_indiv_plot %>%
  ggpubr::ggviolin(x = "Group_specificity", y = "V_errors", fill = "Group_specificity", group = "Group_specificity", 
                    legend = "none") +
  geom_boxplot(outlier.shape = NA, width = 0.15, color = "black", alpha = 0.2)+
  geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 80)) +
  scale_shape_manual(values=NA)+
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
  labs(y = "# IGHV nucleotide mutations",  x= "") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 70) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none") 

ggsave(paste0("../", result.dir, "rsv_specific_SHM_per_group.pdf"), width = 6, height = 3)

7.2.2 Plot somatic hypermutation over time summarized by animal

Here the SHM data is summarized per macaque, so each dot represents the average SHM of all the antigen-specific sequences for one animal.

rds_indiv_plot_summ <- rds_indiv_plot %>%
  filter(Group_specificity != "PostF") %>%
  group_by(ID, Group_specificity) %>%
  summarise(avg_V_errors = mean(V_errors, na.rm = TRUE)) %>%
  ungroup() %>%
  tidyr::drop_na()

stat.test <- rds_indiv_plot_summ %>%
              wilcox_test(formula = avg_V_errors ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
avg_V_errors Total_B1_20-mer Total_B1_1-mer 5 4 4 0.190 0.826 ns 20.4600 Total_B1…. 1 2
avg_V_errors Total_B2_20-mer Total_B2_1-mer 5 4 12 0.730 0.876 ns 22.2204 Total_B2…. 3 4
avg_V_errors PreF_B1_20-mer PreF_B1_1-mer 5 4 11 0.905 0.905 ns 23.9808 PreF_B1_…. 5 6
avg_V_errors PreF_B2_20-mer PreF_B2_1-mer 5 4 12 0.730 0.876 ns 25.7412 PreF_B2_…. 7 8
avg_V_errors DP_B1_20-mer DP_B1_1-mer 5 4 5 0.286 0.826 ns 27.5016 DP_B1_20…. 9 10
avg_V_errors DP_B2_20-mer DP_B2_1-mer 5 4 14 0.413 0.826 ns 29.2620 DP_B2_20…. 11 12
rds_indiv_plot_summ %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "avg_V_errors", fill = "Group_specificity", group = "Group_specificity", 
                    legend = "none") +
  geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 25)) +
  scale_shape_manual(values=NA)+
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  labs(y = "# IGHV nucleotide mutations",  x= "") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 22) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none") 

7.3 Plot HCDR3 length over time for individualized database divided by 20-mer and 1-mer, and specificities

rds_indiv %>%
  filter(specificity_group != "PostF") %>%
  mutate(
    Group_specificity = paste(Group, specificity_group, sep = "_"),
    specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP")),
    Timepoint = factor(Timepoint, levels = c("B1", "B2"))
  ) %>%
  ggplot(aes(x = cdr3_aa_length, fill = Group)) +
  geom_density(alpha = .7) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c("#5F90B0", "#D896C1")) +
  labs(y = "Density", x = "HCDR3 aa length") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~ specificity_group + Timepoint, ncol = 3, dir = "v")

ggsave(paste0("../", result.dir, "rsv_specific_CDR3_length.pdf"), width = 6, height = 3)

8 Comparison of discovered alleles (individualized database)

8.1 Plot V alleles unique and shared per animal, stacked plot

ls <- list.files("../data/databases/individualized", recursive = T, full.names = T, pattern = "V.fasta")
names(ls) <- basename(dirname(ls))
ls <- ls[-1] # remove combined V genes
individualized_dbs <- lapply(ls, Biostrings::readDNAStringSet)
size_indv_db <- data.frame(
  v_count = unlist(lapply(individualized_dbs, length)),
  type = "Shared"
) %>%
  tibble::rownames_to_column("ID")
individualized_dbs <- unlist(Biostrings::DNAStringSetList(individualized_dbs))
duplicated_names <- c(names(unique(individualized_dbs[length(individualized_dbs):1, ])), names(unique(individualized_dbs)))
individualized_dbs_sel <- individualized_dbs[setdiff(names(individualized_dbs), duplicated_names)]
size_unique_db <- data.frame(
  v_unique = table(substr(names(unique(individualized_dbs_sel)), 1, 3)),
  type = "Unique") %>%
  dplyr::rename(v_count = v_unique.Freq,
         ID = v_unique.Var1)


unique_v_indiv <- rbind(size_indv_db, size_unique_db) %>%
  add_row(ID = c("E11", "E24"), v_count = rep(0), type = rep("Unique")) %>%
  group_by(ID) %>%
  arrange(ID, .by_group = TRUE) %>%
  mutate(
    diff = v_count - lag(v_count, default = last(v_count)),
    diff = ifelse(diff < 0, v_count, diff)
  )
unique_v_indiv %>%
  write.csv(paste0("../",result.dir, "alleles_unique_shared_per_animal.csv"), row.names = FALSE)

unique_v_indiv %>%
  mutate(type = factor(type, levels = c("Unique", "Shared"))) %>%
  ggplot(aes(x = ID, y = diff, fill = type)) +
  geom_col(color = "black") +
  scale_fill_viridis_d(direction = -1) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
  scale_x_discrete(expand = c(0, 0)) +
  labs(y = "V alelle counts", x = "Animal ID") +
  theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

ggsave(paste0("../", result.dir, "unique_and_shared_alleles.pdf"), width = 8, height = 6.38)

8.2 Plot V alleles validated or not per animal, stacked plot

kimdb <- Biostrings::readDNAStringSet("../data/databases/nestor_rm/V.fasta")

joined_dbs <- c(kimdb, individualized_dbs_sel)
uniq_joined_dbs <- unique(joined_dbs)
uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]
## DNAStringSet object of length 15:
##      width seq                                              names               
##  [1]   296 CAGGTCCAGCTGGTGCAATCCGG...CCGTGTATTACTGTGCAAGAGA E12.IGHV1-NL_1*01...
##  [2]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E12.IGHV3-100*01
##  [3]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
##  [4]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
##  [5]   298 GAAGTGCAGCTGGTGGAGTCTGG...TTGTATTACTGTAGTAGAGAGA E14.IGHV3-NL_15*0...
##  ...   ... ...
## [11]   296 GAGGTGCAGCTGGCGGAGTCTGG...CCGTGTATTACTGTGCGAGAGA E16.IGHV3-87*02
## [12]   299 CAGGTACAGCTGAAGGAGTCAGG...CCGTGTATTACTGTGCGAGACA E16.IGHV4-NL_23*0...
## [13]   296 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGAGA E18.IGHV4-NL_5*01...
## [14]   299 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGACA E18.IGHV4-NL_33*0...
## [15]   297 GAAGTGCAGCTGGTGGAGTCTGG...CTTGTATTACTGTGCAAAGATA E21.IGHV3-141*01
size_uniq_kimdb <- data.frame(
  v_unique = table(substr(names(uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]), 1, 3)),
  type = "Not validated"
) %>%
  dplyr::rename(
    v_count = v_unique.Freq,
    ID = v_unique.Var1
  )
size_indv_db$type <- "KIMDB"
kimdb_v_indiv <- rbind(size_indv_db, size_uniq_kimdb) %>%
  add_row(ID = c("E11", "E17", "E23", "E24"), v_count = rep(0), type = rep("Not validated")) %>%
  group_by(ID) %>%
  arrange(ID, .by_group = TRUE) %>%
  mutate(
    diff = v_count - lag(v_count, default = last(v_count)),
    diff = ifelse(diff < 0, v_count, diff)
  )

kimdb_v_indiv %>%
  write.csv(paste0("../",result.dir, "alleles_validated_KIMDB.csv"), row.names = FALSE)

kimdb_v_indiv %>%
  mutate(type = factor(type, levels = c("Not validated", "KIMDB"))) %>%
  ggplot(aes(x = ID, y = diff, fill = type)) +
  geom_col(color = "black") +
  scale_fill_viridis_d(direction = -1) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
  scale_x_discrete(expand = c(0, 0)) +
  labs(y = "V alelle counts", x = "Animal ID") +
  theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

ggsave(paste0("../", result.dir, "indiv_validated_KIMDB_alleles.pdf"), width = 8, height = 6.38)

9 Phylogenetic tree LOR21 and LOR24

9.1 Processing data

clonal_tree_data <- clonotype_rsv %>%
  ungroup() %>%
  distinct(new_name, .keep_all = TRUE) %>%
  group_by(specificity_group, sc_clone_grp, grp, ID_timepoint) %>%
  add_tally(name = "clonal_size") %>%
  ungroup()

mabs_of_interests <- c("LOR21" = "E16_05_A08", "LOR24" = "E16_05_H03", "LOR74" = "E11_05_B06")

# read data from heavy_chain
V_genes <- readDNAStringSet("../data/databases/individualized/combined/V.fasta")
J_genes <- readDNAStringSet("../data/databases/individualized/combined/J.fasta")
HC_all_filtered <- clonotype_rsv %>% filter(timepoint == "Single-cell")

# read data from light chain
LV_genes <- readDNAStringSet("../data/databases/cirelli_LC/V.fasta")
LJ_genes <- readDNAStringSet("../data/databases/cirelli_LC/J.fasta")

9.2 Generating fasta files for tree plotting

9.3 Heavy chain

sc_seq_count <- list()
for(i in seq_along(mabs_of_interests)){
  print(i)
  mab <- mabs_of_interests[i]
  
  if(any(stringr::str_count(names(mab), "LOR") <= 1)){
    mab <- mabs_of_interests[i]
    mab_name <- names(mab)
  }
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
    mab_name <- names(mab)}
  # create UCA heavy chain
  UCA_V <- V_genes[HC_all_filtered$V_gene[HC_all_filtered$name %in% mab]]
  UCA_J <- J_genes[HC_all_filtered$J_gene[HC_all_filtered$name %in% mab]]
  UCA <- DNAStringSet(paste0(UCA_V,UCA_J))
  names(UCA) <- paste0(mab_name,"_UCA")
  # select sequences part of the same clonotype
  group <- clonal_tree_data %>% filter(stringr::str_detect(sc_clone_grp, mab)) %>% select(grp) %>% unique() %>% pull()
  filtered <- clonal_tree_data %>%
    filter(grp == group) 
 
  # save csv with b cell lineage lineages info
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    write.csv(filtered, paste0("../",result.dir,mab_name[1],".csv"),
            row.names = FALSE)
  }else{
  write.csv(filtered, paste0("../",result.dir,mab_name,".csv"),
            row.names = FALSE)
  }
  # deduplicating sequences
  fasta <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
  
  fasta_lc <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
  
  #adding single cell sequences
  sc_seqs <- filtered[!duplicated(filtered[,c('sc_clone_grp','VDJ_nt')]),]
  
  #saving duplicated
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    sc_seq_count[[mab_name[1]]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }else{
    sc_seq_count[[mab_name]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }
    sc_seqs <- df_to_fasta(sequence_strings = sc_seqs$VDJ_nt, sequence_name = gsub(":", "_",sc_seqs$new_name), save_fasta = FALSE)
  
  fasta <- c(fasta, sc_seqs, UCA)
  fasta <- fasta[unique(names(fasta))]
  
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]], ".fasta"), append = FALSE, format = "fasta")
    }else{
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, ".fasta"), append = FALSE, format = "fasta") }
 
  
}
## [1] 1
## [1] 2
## [1] 3

9.4 Light chain

for(i in seq_along(mabs_of_interests)){
  print(i)
  mab <- mabs_of_interests[i]
  
  if(any(stringr::str_count(names(mab), "LOR") <= 1)){
    mab <- mabs_of_interests[i]
    mab_name <- names(mab)
  }
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
    mab_name <- names(mab)}
  # create UCA light chain
  UCA_LV <- LV_genes[clono_light_chains$v_call[clono_light_chains$name %in% mab]]
  UCA_LJ <- LJ_genes[clono_light_chains$j_call[clono_light_chains$name %in% mab]]
  UCA_LC <- DNAStringSet(paste0(UCA_LV,UCA_LJ))
  names(UCA_LC) <- paste0(mab_name,"_LC_UCA")
  # select light chain clonotypes
  group <- clono_light_chains %>% filter(stringr::str_detect(name, mab)) %>% select(grp) %>% unique() %>% pull()
  filtered <- clono_light_chains %>%
    filter(grp == group, substr(name,1,3) == substr(mab,1,3)) 
 
  # save csv with b cell lineage lineages info
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    write.csv(filtered, paste0("../",result.dir, mab_name[1],"_LC", ".csv"),
            row.names = FALSE)
  }else{
  write.csv(filtered, paste0("../",result.dir,mab_name,"_LC",".csv"),
            row.names = FALSE)
  }
  # save object as BStringset
  fasta <- df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE)
  
  # save duplicated values
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    sc_seq_count[[paste0(mab_name[1],"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }else{
    sc_seq_count[[paste0(mab_name,"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }
  
  fasta <- c(fasta, UCA_LC)
  fasta <- fasta[names(fasta)]
  
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    writeXStringSet(fasta, filepath = paste0("../",result.dir,  mab_name[[1]],"_LC", ".fasta"), append = FALSE, format = "fasta")
    }else{
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, "_LC", ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
## [1] 3

9.5 Run MUSCLE in bash

Do a Multiple Sequence Alignment using MUSCLE (v 5.1) for all the fasta files generated through out this analysis. Following that, run FastTree (v 2.1.11) for all the aligned sequences and save tree output to be plotted on the following code.


source ~/.bash_profile
DIR_DATE=$(date +'%Y-%m-%d')
cd ../results/$DIR_DATE/    
for f in *.fasta; do muscle -align $f -output $f.aln; done

for f in *.aln; do fasttree -nt -gtr $f > $f.tre; done
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 217 seqs, avg length 364, max 370
## 
## 00:00 4.5Mb  CPU has 8 cores, running 8 threads
## 00:00 4.7Mb   0.0043% Calc posteriors
00:01 181Mb     1.1% Calc posteriors 
00:02 203Mb     2.1% Calc posteriors
00:03 236Mb     3.3% Calc posteriors
00:04 275Mb     4.3% Calc posteriors
00:05 296Mb     5.4% Calc posteriors
00:06 319Mb     6.4% Calc posteriors
00:07 329Mb     7.1% Calc posteriors
00:08 355Mb     8.1% Calc posteriors
00:09 376Mb     9.0% Calc posteriors
00:10 399Mb    10.0% Calc posteriors
00:11 404Mb    11.1% Calc posteriors
00:12 417Mb    12.2% Calc posteriors
00:13 438Mb    13.3% Calc posteriors
00:14 483Mb    14.3% Calc posteriors
00:15 508Mb    15.3% Calc posteriors
00:16 526Mb    16.4% Calc posteriors
00:17 565Mb    17.3% Calc posteriors
00:18 616Mb    18.3% Calc posteriors
00:19 652Mb    19.2% Calc posteriors
00:20 683Mb    20.2% Calc posteriors
00:21 692Mb    21.3% Calc posteriors
00:22 729Mb    22.3% Calc posteriors
00:23 777Mb    23.3% Calc posteriors
00:24 795Mb    24.4% Calc posteriors
00:25 824Mb    25.4% Calc posteriors
00:26 839Mb    26.4% Calc posteriors
00:27 846Mb    27.4% Calc posteriors
00:28 861Mb    28.4% Calc posteriors
00:29 868Mb    29.4% Calc posteriors
00:30 881Mb    30.4% Calc posteriors
00:31 903Mb    31.4% Calc posteriors
00:32 931Mb    32.4% Calc posteriors
00:33 943Mb    33.4% Calc posteriors
00:34 847Mb    34.4% Calc posteriors
00:35 874Mb    35.4% Calc posteriors
00:36 876Mb    36.4% Calc posteriors
00:37 884Mb    37.5% Calc posteriors
00:38 897Mb    38.5% Calc posteriors
00:39 828Mb    39.5% Calc posteriors
00:40 832Mb    40.6% Calc posteriors
00:41 845Mb    41.6% Calc posteriors
00:42 847Mb    42.7% Calc posteriors
00:43 868Mb    43.7% Calc posteriors
00:44 891Mb    44.7% Calc posteriors
00:45 911Mb    45.8% Calc posteriors
00:46 927Mb    46.8% Calc posteriors
00:47 951Mb    47.8% Calc posteriors
00:48 994Mb    48.9% Calc posteriors
00:49 1.0Gb    49.9% Calc posteriors
00:50 1.0Gb    50.9% Calc posteriors
00:51 1.0Gb    52.0% Calc posteriors
00:52 1.1Gb    53.0% Calc posteriors
00:53 1.1Gb    53.8% Calc posteriors
00:54 1.1Gb    54.5% Calc posteriors
00:55 1.1Gb    55.0% Calc posteriors
00:56 1.1Gb    55.9% Calc posteriors
00:57 1.2Gb    56.8% Calc posteriors
00:58 1.2Gb    57.8% Calc posteriors
00:59 1.1Gb    58.8% Calc posteriors
01:00 1.1Gb    59.7% Calc posteriors
01:01 1.1Gb    60.5% Calc posteriors
01:02 1.1Gb    61.2% Calc posteriors
01:03 991Mb    61.8% Calc posteriors
01:04 998Mb    62.4% Calc posteriors
01:05 1.0Gb    63.0% Calc posteriors
01:06 1.0Gb    63.6% Calc posteriors
01:07 1.0Gb    64.1% Calc posteriors
01:08 1.0Gb    64.7% Calc posteriors
01:09 1.0Gb    65.2% Calc posteriors
01:10 1.0Gb    65.8% Calc posteriors
01:11 1.0Gb    66.4% Calc posteriors
01:12 1.1Gb    66.9% Calc posteriors
01:13 1.1Gb    67.5% Calc posteriors
01:14 1.1Gb    68.1% Calc posteriors
01:15 1.1Gb    68.7% Calc posteriors
01:16 1.1Gb    69.3% Calc posteriors
01:17 1.1Gb    70.0% Calc posteriors
01:18 1.1Gb    70.7% Calc posteriors
01:19 1.1Gb    71.4% Calc posteriors
01:20 1.1Gb    72.1% Calc posteriors
01:21 1.0Gb    73.0% Calc posteriors
01:22 1.0Gb    73.8% Calc posteriors
01:23 1.1Gb    74.7% Calc posteriors
01:24 1.1Gb    75.7% Calc posteriors
01:25 1.1Gb    76.5% Calc posteriors
01:26 1.1Gb    77.5% Calc posteriors
01:27 1.1Gb    78.3% Calc posteriors
01:28 1.1Gb    79.2% Calc posteriors
01:29 1.2Gb    80.0% Calc posteriors
01:30 1.2Gb    80.8% Calc posteriors
01:31 1.2Gb    81.6% Calc posteriors
01:32 1.2Gb    82.4% Calc posteriors
01:33 1.2Gb    83.2% Calc posteriors
01:34 1.2Gb    84.0% Calc posteriors
01:35 1.2Gb    84.7% Calc posteriors
01:36 1.1Gb    85.5% Calc posteriors
01:37 1.1Gb    86.3% Calc posteriors
01:38 1.1Gb    87.0% Calc posteriors
01:39 1.1Gb    87.8% Calc posteriors
01:40 1.1Gb    88.6% Calc posteriors
01:41 1.1Gb    89.4% Calc posteriors
01:42 1.2Gb    90.3% Calc posteriors
01:43 1.2Gb    91.0% Calc posteriors
01:44 1.0Gb    91.7% Calc posteriors
01:45 1.1Gb    92.5% Calc posteriors
01:46 1.1Gb    93.3% Calc posteriors
01:47 1.1Gb    94.1% Calc posteriors
01:48 1.1Gb    94.8% Calc posteriors
01:49 1.1Gb    95.7% Calc posteriors
01:50 1.1Gb    96.4% Calc posteriors
01:51 1.1Gb    97.3% Calc posteriors
01:52 1.2Gb    98.1% Calc posteriors
01:53 1.2Gb    99.0% Calc posteriors
01:54 1.2Gb    99.8% Calc posteriors
01:54 1.2Gb   100.0% Calc posteriors
## 01:54 1.2Gb   0.0043% Consistency (1/2)
01:55 1.2Gb     7.4% Consistency (1/2) 
01:56 1.2Gb    21.8% Consistency (1/2)
01:57 1.3Gb    36.8% Consistency (1/2)
01:58 1.3Gb    51.0% Consistency (1/2)
01:59 1.3Gb    65.9% Consistency (1/2)
02:00 1.3Gb    80.4% Consistency (1/2)
02:01 1.4Gb    94.9% Consistency (1/2)
02:01 1.4Gb   100.0% Consistency (1/2)
## 02:01 1.4Gb   0.0043% Consistency (2/2)
02:02 1.4Gb     4.2% Consistency (2/2) 
02:03 1.4Gb    17.8% Consistency (2/2)
02:04 1.4Gb    31.7% Consistency (2/2)
02:05 1.4Gb    44.5% Consistency (2/2)
02:06 1.4Gb    55.6% Consistency (2/2)
02:07 1.4Gb    68.1% Consistency (2/2)
02:08 1.4Gb    78.5% Consistency (2/2)
02:09 1.4Gb    89.8% Consistency (2/2)
02:09 1.4Gb   100.0% Consistency (2/2)
## 02:09 1.4Gb    0.46% UPGMA5           
02:09 1.4Gb   100.0% UPGMA5
## 02:10 1.4Gb     1.0% Refining
02:11 1.4Gb    20.0% Refining
02:12 1.4Gb    44.0% Refining
02:13 1.4Gb    68.0% Refining
02:14 1.4Gb    85.0% Refining
02:14 1.4Gb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 8 seqs, avg length 322, max 326
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.3Mb     3.6% Calc posteriors
00:01 68Mb     89.3% Calc posteriors
00:01 70Mb    100.0% Calc posteriors
## 00:01 73Mb      3.6% Consistency (1/2)
00:01 73Mb    100.0% Consistency (1/2)
## 00:01 73Mb      3.6% Consistency (2/2)
00:01 73Mb    100.0% Consistency (2/2)
## 00:01 73Mb     14.3% UPGMA5           
00:01 73Mb    100.0% UPGMA5
## 00:01 73Mb      1.0% Refining
00:01 74Mb    100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 118 seqs, avg length 370, max 370
## 
## 00:00 2.9Mb  CPU has 8 cores, running 8 threads
## 00:00 3.1Mb   0.014% Calc posteriors
00:01 180Mb     2.9% Calc posteriors
00:02 192Mb     5.3% Calc posteriors
00:03 208Mb     7.5% Calc posteriors
00:04 211Mb     9.5% Calc posteriors
00:05 226Mb    12.0% Calc posteriors
00:06 249Mb    14.9% Calc posteriors
00:07 261Mb    17.8% Calc posteriors
00:08 279Mb    21.1% Calc posteriors
00:09 282Mb    24.3% Calc posteriors
00:10 294Mb    27.5% Calc posteriors
00:11 312Mb    30.5% Calc posteriors
00:12 323Mb    33.4% Calc posteriors
00:13 344Mb    36.3% Calc posteriors
00:14 346Mb    39.1% Calc posteriors
00:15 356Mb    41.9% Calc posteriors
00:16 368Mb    44.7% Calc posteriors
00:17 398Mb    47.4% Calc posteriors
00:18 416Mb    50.0% Calc posteriors
00:19 430Mb    52.5% Calc posteriors
00:20 442Mb    55.0% Calc posteriors
00:21 459Mb    57.6% Calc posteriors
00:22 482Mb    60.1% Calc posteriors
00:23 493Mb    62.8% Calc posteriors
00:24 517Mb    65.4% Calc posteriors
00:25 562Mb    68.1% Calc posteriors
00:26 617Mb    70.7% Calc posteriors
00:27 630Mb    72.6% Calc posteriors
00:28 660Mb    75.5% Calc posteriors
00:29 676Mb    78.4% Calc posteriors
00:30 694Mb    81.4% Calc posteriors
00:31 715Mb    84.4% Calc posteriors
00:32 740Mb    87.4% Calc posteriors
00:33 749Mb    90.2% Calc posteriors
00:34 754Mb    93.2% Calc posteriors
00:35 764Mb    96.1% Calc posteriors
00:36 790Mb    99.1% Calc posteriors
00:36 801Mb   100.0% Calc posteriors
## 00:36 801Mb   0.014% Consistency (1/2)
00:37 828Mb    53.2% Consistency (1/2)
00:37 852Mb   100.0% Consistency (1/2)
## 00:37 852Mb   0.014% Consistency (2/2)
00:38 853Mb    44.4% Consistency (2/2)
00:38 853Mb   100.0% Consistency (2/2)
## 00:38 853Mb    0.85% UPGMA5           
00:38 853Mb   100.0% UPGMA5
## 00:38 854Mb     1.0% Refining
00:39 855Mb    21.0% Refining
00:39 860Mb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 10 seqs, avg length 331, max 334
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.3Mb     2.2% Calc posteriors
00:01 78Mb     68.9% Calc posteriors
00:01 86Mb    100.0% Calc posteriors
## 00:01 88Mb      2.2% Consistency (1/2)
00:01 89Mb    100.0% Consistency (1/2)
## 00:01 89Mb      2.2% Consistency (2/2)
00:01 89Mb    100.0% Consistency (2/2)
## 00:01 89Mb     11.1% UPGMA5           
00:01 89Mb    100.0% UPGMA5
## 00:01 89Mb      1.0% Refining
00:01 89Mb    100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 71 seqs, avg length 370, max 370
## 
## 00:00 2.4Mb  CPU has 8 cores, running 8 threads
## 00:00 2.6Mb    0.04% Calc posteriors
00:01 170Mb     7.4% Calc posteriors
00:02 225Mb    15.6% Calc posteriors
00:03 247Mb    23.4% Calc posteriors
00:04 262Mb    31.7% Calc posteriors
00:05 270Mb    39.8% Calc posteriors
00:06 280Mb    46.8% Calc posteriors
00:07 301Mb    54.6% Calc posteriors
00:08 323Mb    62.8% Calc posteriors
00:09 335Mb    70.7% Calc posteriors
00:10 345Mb    78.7% Calc posteriors
00:11 364Mb    86.7% Calc posteriors
00:12 379Mb    94.2% Calc posteriors
00:12 382Mb   100.0% Calc posteriors
## 00:12 382Mb    0.04% Consistency (1/2)
00:13 393Mb    53.5% Consistency (1/2)
00:13 404Mb   100.0% Consistency (1/2)
## 00:13 404Mb    0.04% Consistency (2/2)
00:13 407Mb   100.0% Consistency (2/2)
## 00:13 407Mb     1.4% UPGMA5           
00:13 407Mb   100.0% UPGMA5
## 00:13 408Mb     1.0% Refining
00:13 412Mb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 3 seqs, avg length 332, max 334
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.3Mb    33.3% Calc posteriors
00:00 2.6Mb   100.0% Calc posteriors
## 00:00 17Mb     33.3% Consistency (1/2)
00:00 17Mb    100.0% Consistency (1/2)
## 00:00 18Mb     33.3% Consistency (2/2)
00:00 18Mb    100.0% Consistency (2/2)
## 00:00 18Mb     50.0% UPGMA5           
00:00 18Mb    100.0% UPGMA5
## 00:00 19Mb      1.0% Refining
00:01 19Mb      5.0% Refining
00:01 19Mb    100.0% Refining
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.07 seconds
## Refining topology: 31 rounds ME-NNIs, 2 rounds ME-SPRs, 16 rounds ML-NNIs
##       0.13 seconds: SPR round   1 of   2, 101 of 432 nodes
##       0.27 seconds: SPR round   1 of   2, 401 of 432 nodes
##       0.38 seconds: SPR round   2 of   2, 201 of 432 nodes
##       0.48 seconds: ME NNI round 21 of 31, 1 of 215 splits
## Total branch-length 2.197 after 0.50 sec
##       0.61 seconds: ML NNI round 1 of 16, 101 of 215 splits, 27 changes (max delta 1.450)
## ML-NNI round 1: LogLk = -5720.973 NNIs 57 max delta 7.09 Time 0.69
##       0.74 seconds: Optimizing GTR model, step 2 of 12
##       0.85 seconds: Optimizing GTR model, step 4 of 12
##       0.98 seconds: Optimizing GTR model, step 6 of 12
##       1.09 seconds: Optimizing GTR model, step 9 of 12
##       1.19 seconds: Optimizing GTR model, step 12 of 12
## GTR Frequencies: 0.2176 0.2444 0.3116 0.2264
## GTR rates(ac ag at cg ct gt) 1.2448 1.4803 0.7188 0.5852 1.4823 1.0000
##       1.29 seconds: Site likelihoods with rate category 3 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.816 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
##       1.49 seconds: ML NNI round 2 of 16, 101 of 215 splits, 38 changes (max delta 0.001)
##       1.62 seconds: ML NNI round 2 of 16, 201 of 215 splits, 63 changes (max delta 0.001)
## ML-NNI round 2: LogLk = -5299.576 NNIs 67 max delta 0.00 Time 1.66
## Turning off heuristics for final round of ML NNIs (converged)
##       1.82 seconds: ML NNI round 3 of 16, 101 of 215 splits, 40 changes (max delta 0.000)
##       1.95 seconds: ML NNI round 3 of 16, 201 of 215 splits, 57 changes (max delta 1.019)
## ML-NNI round 3: LogLk = -5298.548 NNIs 61 max delta 1.02 Time 1.98 (final)
## Optimize all lengths: LogLk = -5298.540 Time 2.04
##       2.17 seconds: ML split tests for    100 of    214 internal splits
##       2.32 seconds: ML split tests for    200 of    214 internal splits
## Total time: 2.34 seconds Unique: 217/217 Bad splits: 0/214
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 10 rounds ME-NNIs, 2 rounds ME-SPRs, 5 rounds ML-NNIs
## Total branch-length 0.080 after 0.00 sec
## ML-NNI round 1: LogLk = -612.010 NNIs 1 max delta 0.13 Time 0.00
## GTR Frequencies: 0.2670 0.2510 0.2459 0.2361
## GTR rates(ac ag at cg ct gt) 0.1273 14.1848 2.8987 3.9092 3.0704 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.642 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -587.982 NNIs 0 max delta 0.00 Time 0.02
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -587.981 NNIs 0 max delta 0.00 Time 0.02 (final)
## Optimize all lengths: LogLk = -587.981 Time 0.02
## Total time: 0.03 seconds Unique: 6/8 Bad splits: 0/3
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.03 seconds
## Refining topology: 28 rounds ME-NNIs, 2 rounds ME-SPRs, 14 rounds ML-NNIs
##       0.12 seconds: SPR round   1 of   2, 201 of 234 nodes
##       0.24 seconds: SPR round   2 of   2, 201 of 234 nodes
## Total branch-length 1.596 after 0.28 sec
##       0.47 seconds: ML NNI round 1 of 14, 101 of 116 splits, 29 changes (max delta 5.999)
## ML-NNI round 1: LogLk = -4057.963 NNIs 35 max delta 6.00 Time 0.49
##       0.58 seconds: Optimizing GTR model, step 5 of 12
##       0.69 seconds: Optimizing GTR model, step 11 of 12
## GTR Frequencies: 0.2038 0.2859 0.3002 0.2101
## GTR rates(ac ag at cg ct gt) 1.0761 0.9774 0.5841 0.4792 1.2071 1.0000
##       0.79 seconds: Site likelihoods with rate category 18 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.789 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
##       0.89 seconds: ML NNI round 2 of 14, 101 of 116 splits, 15 changes (max delta 0.290)
## ML-NNI round 2: LogLk = -3817.538 NNIs 18 max delta 1.05 Time 0.92
## ML-NNI round 3: LogLk = -3814.285 NNIs 1 max delta 3.25 Time 0.97
## ML-NNI round 4: LogLk = -3814.285 NNIs 0 max delta 0.00 Time 1.02
## Turning off heuristics for final round of ML NNIs (converged)
##       1.01 seconds: ML NNI round 5 of 14, 1 of 116 splits
## ML-NNI round 5: LogLk = -3814.284 NNIs 0 max delta 0.00 Time 1.13 (final)
##       1.12 seconds: ML Lengths 1 of 116 splits
## Optimize all lengths: LogLk = -3814.284 Time 1.15
##       1.28 seconds: ML split tests for    100 of    115 internal splits
## Total time: 1.31 seconds Unique: 118/118 Bad splits: 0/115
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 13 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
## Total branch-length 0.104 after 0.00 sec
## ML-NNI round 1: LogLk = -699.140 NNIs 1 max delta 0.00 Time 0.01
## GTR Frequencies: 0.2026 0.3061 0.2592 0.2321
## GTR rates(ac ag at cg ct gt) 0.7128 4.0992 1.9319 0.9442 0.6308 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.648 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -675.314 NNIs 1 max delta 0.00 Time 0.03
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -675.314 NNIs 0 max delta 0.00 Time 0.03 (final)
## Optimize all lengths: LogLk = -675.314 Time 0.04
## Total time: 0.05 seconds Unique: 9/10 Bad splits: 0/6
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.01 seconds
## Refining topology: 25 rounds ME-NNIs, 2 rounds ME-SPRs, 12 rounds ML-NNIs
##       0.10 seconds: ME NNI round 17 of 25, 1 of 69 splits
## Total branch-length 0.623 after 0.11 sec
## ML-NNI round 1: LogLk = -2035.235 NNIs 32 max delta 1.30 Time 0.17
##       0.21 seconds: Optimizing GTR model, step 5 of 12
## GTR Frequencies: 0.2003 0.2812 0.3012 0.2172
## GTR rates(ac ag at cg ct gt) 1.3353 1.7821 0.7868 0.4971 1.3891 1.0000
##       0.31 seconds: Site likelihoods with rate category 14 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.772 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -1916.232 NNIs 8 max delta 0.00 Time 0.39
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -1916.232 NNIs 1 max delta 0.00 Time 0.45 (final)
##       0.45 seconds: ML Lengths 1 of 69 splits
## Optimize all lengths: LogLk = -1916.232 Time 0.47
## Total time: 0.57 seconds Unique: 71/71 Bad splits: 0/68
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 6 rounds ME-NNIs, 2 rounds ME-SPRs, 3 rounds ML-NNIs
## Total branch-length 0.053 after 0.00 sec
## ML-NNI round 1: LogLk = -560.888 NNIs 0 max delta 0.00 Time 0.00
## GTR Frequencies: 0.1960 0.3050 0.2620 0.2370
## GTR rates(ac ag at cg ct gt) 3.2025 7.6580 2.7938 2.4124 1.8012 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.635 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00 (final)
## Optimize all lengths: LogLk = -548.791 Time 0.00
## Total time: 0.01 seconds Unique: 3/3 Bad splits: 0/0

9.6 Reading and plotting trees

Generated trees arre edited using treeio and plotted with ggtree. The trees were rerooted to their respectives Unmutated Common Ancestor (UCA), which in this case is defined as just the V and J gene germlines combined. The gap between V-J is inserted automatically by the alignment method, thus the CDR3 here is not considered for the UCA.

# function to root tree in UCA 
.to_root_uca <- function(x){
root(x,which(grepl("UCA",x[["tip.label"]])))
}

ls <- list.files(paste0("../",result.dir), pattern = "*\\.tre", full.names = TRUE)

names(ls) <- lapply(ls, function(x) {
  if (stringr::str_count(x, "LOR") > 1) {
    substr(x, 24, 34)
  }else if (grepl("LC",x)) {
    substr(x, 24, 31)
  }else if (grepl("LOR",x)) {
    substr(x, 24, 28)
  }else{
    substr(x, 24,33)
    }})
trees <- lapply(ls, read.tree)
trees_rerooted <- lapply(trees, .to_root_uca)
plots <- lapply(trees_rerooted, ggtree)

for(i in seq_along(plots)){
  print(i)
  plot_name <- names(plots)[i]
  
  plots_edit <- plots[[i]]$data %>%
    mutate(new_label = ifelse(grepl("sc_|LOR",label), gsub("sc_", "",label), ""),
           new_label = plyr::mapvalues(new_label,from = lor_mabs$well_ID, to = lor_mabs$LOR,
                                       warn_missing = FALSE),
           new_label = ifelse(grepl("LOR",new_label), new_label, ""),
           name = label,
           timepoint = case_when(grepl("B2-igg",name) ~ "B2",
                                 grepl("B1-igg",name) ~ "B1",
                                 grepl("sc_|LOR",name) ~ "single_cell",
                                 grepl("igm",name) ~ "PV",
                                 TRUE ~ "intersects"),
           name = gsub("sc_", "", label))
  
  
  if(stringr::str_count(plot_name, "LOR") > 1){
  plots_edit <- left_join(plots_edit, sc_seq_count[[paste0(plot_name,1)]], by = "name") 
  }else{ 
  plots_edit <- left_join(plots_edit, sc_seq_count[[plot_name]], by = "name") 
  }
 # shapes_timepoints <- c("PV" = 18, "B1" = 17, "B2" = 16, "single_cell" = 4)
  colors_timepoints <- c("PV" = "red", "intersects" = "black", "B1" = "#66c2a5", "B2" = "#fc8d62", "single_cell" = "#fc8d62")

  gg_plot <- ggtree(plots_edit,aes(color = timepoint)) + 
    {if(grepl("LC", plot_name))geom_tippoint(shape = 18, size = 1)
      else geom_tippoint(aes(size = duplicated), shape = 18)}+
   # geom_tippoint(aes(size = duplicated), shape = 23) +
    geom_tiplab(aes(label = new_label), hjust = -.2) +
    labs(size = "Count", shape = "Shape", color = "Timepoint") + 
    scale_color_manual(values = colors_timepoints) +
    coord_cartesian(clip = 'off') + 
   # scale_shape_manual(values = shapes_timepoints) +
    scale_size_area(limits = c(1,25), breaks = c(1,5,10,15,25), max_size = 3)+
    geom_treescale(width = .05)

  print(gg_plot)
  
   ggsave(gg_plot, filename = paste0("../", result.dir,  names(ls)[[i]], "_tree.pdf"), width = ifelse(grepl("LC", plot_name), 3, 4), height = ifelse(grepl("LC", plot_name), 3, 6))
  }
## [1] 1

## [1] 2

## [1] 3

## [1] 4

## [1] 5

## [1] 6

11 Profiling expressed monoclonal antibodies

11.1 Heatmap of competition between LOR mAbs

LOR mAbs had a diverse competition profiles against different standardized mAbs while binding to PostF and PreF RSV fusion proteins.

data_comp_auc$EC50_preF <- log10(data_comp_auc$EC50_preF)
data_comp_auc$EC50_postF <- log10(data_comp_auc$EC50_postF)
data_comp_auc$IC50_RSV_A2 <- log10(data_comp_auc$IC50_RSV_A2)

# change here the columns you want to remove from the heatmap
to_remove <- c("EC50_postF","EC50_preF", "epitope", "specificity", "IC50_RSV_A2")
annot_colors <- list(specificity = c("PreF" = "#F7D586",
                                     "PreF/PostF" = "#CD87F8",
                                     "PostF" = "#92CDD6",
                                     "w.b." = "grey20"), epitope = c(
                                                    "Ø" = "#F3B084",
                                                    "V" = "salmon", 
                                                    "III" = "#A9D08D",
                                                    "IV" = "#FAD0FF",
                                                    "II" = "#FFD966",
                                                    "I" = "#9BC1E6",
                                                    "UK-Pre" = "#C9C9C9",
                                                    "UK-DP" = "#8497B0" ,
                                                    "WB" = "grey95",
                                                    "PostF" = "black",
                                                    "foldon" = "grey40"))

{
  g_heatmap_scale <- data_comp_auc %>%
    select(-all_of(to_remove)) %>%
    mutate(across(.cols = everything(), .fns = function(x) pmax(x,0))) %>%
    t() %>%
    pheatmap::pheatmap(scale = "none", angle_col = 90, cutree_cols = 8, 
                     clustering_method = "ward.D", 
                     color = viridisLite::viridis(100), 
                     cluster_rows = FALSE, border_color = "grey40", 
                     cluster_cols = TRUE, 
                     legend_breaks = c(0, 0.5, 1, 1.5, max(.)),
                     legend_labels = c("0", "0.5", "1", "1.5", "Normalized\ncompetition\n"),
                     annotation_col = data_comp_auc[,13:14], 
                     annotation_colors = annot_colors, 
                     cellwidth = 5, cellheight = 5, fontsize_row = 5, fontsize_col = 6, fontsize = 6)
 
  ggsave(g_heatmap_scale,filename = paste0("../", result.dir, "/", "LOR_heatmap_auc_wardD.pdf"), width = 18, height = 12, units = "cm")
}

11.2 Multidimensional scaling of competition between LOR mAbs

This MDS is another way to visualize the same data showed on the heatmap above. Here, the MDS input is the euclidean distance. The data was scaled and centered prior calculating their euclidean distance, which is the most used and simplest distance calculation.

11.2.1 Colored by epitope specificity

# check which mabs should be included and added
to_habillage <- factor(rownames(data_comp_auc), levels = c(paste0("LOR", sprintf(fmt = "%02d",seq(1:106)))))

data_comp_auc <- data_comp_auc %>% tibble::rownames_to_column("name") 
to_remove <- c("epitope", "specificity", "IC50_RSV_A2")

# compute MDS
mds_scaled <- data_comp_auc %>%
  select(-all_of(c(to_remove, "name"))) %>%
  scale(center = TRUE, scale = TRUE) %>%
  dist(method = "euclidean") %>%          
  cmdscale() %>%
  as_tibble()

colnames(mds_scaled) <- c("Dim.1", "Dim.2")
# Plot MDS
mds_scaled$name <- to_habillage
  
p1 <- mds_scaled %>%
  left_join(data_comp_auc) %>%
  ggplot(aes(Dim.1,Dim.2, label = name, color = epitope)) +
  geom_point(size = 2) +
  ggrepel::geom_text_repel(max.overlaps = 5) +
  scale_color_manual(values = annot_colors[["epitope"]]) +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5),
        aspect.ratio = 1) 

plotly::ggplotly(p1)
ggsave(plot = p1, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-sites.pdf"))

11.2.2 Colored by RSV neutralization

p2 <- mds_scaled %>%
  left_join(data_comp_auc) %>%
  ggplot(aes(Dim.1,Dim.2, label = name, color = IC50_RSV_A2)) +
  geom_point(size = 2) +
  ggrepel::geom_text_repel(max.overlaps = 5) +
  scale_color_viridis_c() +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  labs(color = "IC50 RSV\n(log10)")+
  theme(axis.ticks = element_line(size = .5),
        aspect.ratio = 1) 

plotly::ggplotly(p2)
ggsave(plot = p2, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-neuts.pdf"))

11.3 Processing and merging LOR metadata

mabs_lors <- mabs_lors %>%
  filter(name %in% lor_mabs$well_ID) %>%
  mutate(mAbs_ID = plyr::mapvalues(name, from = lor_mabs$well_ID, to = lor_mabs$LOR)) %>%
  arrange(mAbs_ID) %>%
  relocate(mAbs_ID)

mabs_clonal_rank <- rds_summary
for(i in mabs_lors$name) {
  mabs_clonal_rank[[i]] <- ifelse(grepl(i, rds_summary$sc_clone_grp), 
                                  mabs_lors[mabs_lors$name == i,]$mAbs_ID, 
                                  NA)
}

col_combine <- colnames(mabs_clonal_rank[15:length(mabs_clonal_rank)])         
mabs_clonal_rank <- mabs_clonal_rank %>%  
  mutate(LOR = purrr::invoke(coalesce, across(all_of(col_combine)))) %>%
  select(LOR, colnames(mabs_clonal_rank)[! colnames(mabs_clonal_rank) %in% col_combine]) %>%
  filter(!is.na(LOR), database == "Individualized", Timepoint == "B2") %>%
  arrange(LOR) %>%
  mutate(well_ID = plyr::mapvalues(LOR, from = lor_mabs$LOR, to = substr(lor_mabs$well_ID, 1,3))) %>%
  filter(ID == well_ID)
  
mabs_lors <- mabs_lors %>% 
  mutate(clonal_rank_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size_rank),
         clonal_size_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size),
         clonal_group = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$grp)) %>%
  relocate(mAbs_ID, clonal_rank_B2, clonal_size_B2, clonal_group)

# Fixed manually LOR24 (same as LOR19), LOR37 (not detected B2, clonal size sc = 1), LOR40 (not detected B2, clonal size sc = 1 ), LOR42 (same clone as LOR01), LOR94 (same clone as LOR01), LOR96 (not detected B2, clonal size sc = 4)

#mabs_lors %>% write.csv(paste0("../data/single_cell/LOR_mAbs_info.csv"), row.names = F)
mabs_lors <- read.csv(file = paste0("../data/single_cell/LOR_mAbs_info.csv"), sep = ",")

mabs_lors %>%
  left_join(data_comp_auc %>% tibble::rowid_to_column("mAbs_ID") %>% mutate(mAbs_ID = as.character(name)), by = "mAbs_ID") %>%
ggplot(aes(x = clonal_rank_B2, y = clonal_size_B2, color = epitope)) +
  geom_point(size = 3) +
  scale_color_manual(values = annot_colors[["epitope"]]) +
  scale_y_log10() +
  scale_x_log10() +
  labs(fill = "Animal ID", x = "Clonal rank\n(log10)", y = "Clonal size\n(log10)") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

11.4 Heavy and light chain paired V genes

merged_mabs_lors <- mabs_lors %>%
  left_join(clonotype_rsv, by = "name") %>%
  left_join(clono_light_chains,suffix = c("_HC","_LC"), by = "name") %>%
  mutate(v_call_HC = V_gene.x) %>%
  fill(v_call_LC)

write.csv(merged_mabs_lors, file = "../data/single_cell/LOR_mAbs_info_full.csv", row.names = F)

merged_mabs_lors %>%
  group_by(v_call_HC, v_call_LC) %>%
  mutate(mabs_counts = n()) %>%
  ggplot(aes(x= v_call_HC, y = v_call_LC, fill = mabs_counts)) +
    geom_tile(color = "black") +
    scale_fill_viridis_c(option = "viridis", direction = 1) +
    scale_y_discrete(position = "right") +
    labs(fill = "mAb counts", x = "IGHV alleles", y = "IGHJ alleles") + 
    coord_equal() +
    ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
     theme(axis.text.x = element_text(angle = 90, size = 6, hjust = 1, 
                                     vjust = 0.5, face = "bold", colour = "black"),
          axis.text.y = element_text(face = "bold", colour = "black", size = 6),
          strip.text = element_text(face = "bold"),
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          legend.position =  "right",
          axis.ticks = element_line(size = .5),
          legend.title = element_text())

12 Take environment snapshot

renv::snapshot()

13 SessionInfo

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS/LAPACK: /Users/rodrigoarcoverde/opt/anaconda3/envs/rsv_np_repertoire/lib/libopenblasp-r0.3.21.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4    dplyr_1.1.0         ggprism_1.0.4      
##  [4] scales_1.2.1        vegan_2.6-4         lattice_0.20-45    
##  [7] permute_0.9-7       data.table_1.14.8   Biostrings_2.66.0  
## [10] GenomeInfoDb_1.34.8 XVector_0.38.0      IRanges_2.32.0     
## [13] S4Vectors_0.36.0    BiocGenerics_0.44.0 ggpubr_0.6.0       
## [16] ggplot2_3.4.1       rstatix_0.7.2       ggtree_3.6.0       
## [19] treeio_1.22.0       tidyr_1.3.0         jsonlite_1.8.4     
## 
## loaded via a namespace (and not attached):
##  [1] ggVennDiagram_1.2.2    colorspace_2.1-0       ggsignif_0.6.4        
##  [4] ellipsis_0.3.2         class_7.3-21           aplot_0.1.10          
##  [7] rstudioapi_0.14        proxy_0.4-27           farver_2.1.1          
## [10] ggrepel_0.9.3          fansi_1.0.4            xml2_1.3.3            
## [13] splines_4.2.2          cachem_1.0.7           knitr_1.42            
## [16] broom_1.0.4            cluster_2.1.4          pheatmap_1.0.12       
## [19] compiler_4.2.2         httr_1.4.5             backports_1.4.1       
## [22] Matrix_1.5-3           fastmap_1.1.1          lazyeval_0.2.2        
## [25] cli_3.6.0              htmltools_0.5.4        tools_4.2.2           
## [28] gtable_0.3.1           glue_1.6.2             GenomeInfoDbData_1.2.9
## [31] Rcpp_1.0.10            carData_3.0-5          jquerylib_0.1.4       
## [34] vctrs_0.5.2            ape_5.7                svglite_2.1.1         
## [37] nlme_3.1-162           crosstalk_1.2.0        xfun_0.37             
## [40] stringr_1.5.0          rvest_1.0.3            lifecycle_1.0.3       
## [43] zlibbioc_1.44.0        MASS_7.3-58.3          parallel_4.2.2        
## [46] RColorBrewer_1.1-3     yaml_2.3.7             gridExtra_2.3         
## [49] ggfun_0.0.9            yulab.utils_0.0.6      sass_0.4.5            
## [52] stringi_1.7.12         highr_0.10             tidytree_0.4.2        
## [55] e1071_1.7-13           rlang_1.0.6            pkgconfig_2.0.3       
## [58] systemfonts_1.0.4      bitops_1.0-7           evaluate_0.20         
## [61] purrr_1.0.1            sf_1.0-7               patchwork_1.1.2       
## [64] htmlwidgets_1.6.1      labeling_0.4.2         cowplot_1.1.1         
## [67] tidyselect_1.2.0       plyr_1.8.8             magrittr_2.0.3        
## [70] R6_2.5.1               generics_0.1.3         DBI_1.1.3             
## [73] pillar_1.8.1           withr_2.5.0            mgcv_1.8-42           
## [76] units_0.8-1            abind_1.4-5            RCurl_1.98-1.10       
## [79] tibble_3.2.0           crayon_1.5.2           car_3.1-1             
## [82] KernSmooth_2.23-20     utf8_1.2.3             plotly_4.10.1         
## [85] RVenn_1.1.0            rmarkdown_2.20         grid_4.2.2            
## [88] digest_0.6.31          classInt_0.4-9         webshot_0.5.4         
## [91] gridGraphics_0.5-1     munsell_0.5.0          viridisLite_0.4.1     
## [94] ggplotify_0.1.0        bslib_0.4.2